Algorithm

Hierarchical clustering can be divided into two main types: agglomerative and divisive.

Agglomerative clustering: It’s also known as AGNES (Agglomerative Nesting). It works in a bottom-up manner. That is, each object is initially considered as a single-element cluster (leaf). At each step of the algorithm, the two clusters that are the most similar are combined into a new bigger cluster (nodes). This procedure is iterated until all points are member of just one single big cluster (root) (see figure below). The result is a tree which can be plotted as a dendrogram.

Divisive hierarchical clustering: It’s also known as DIANA (Divise Analysis) and it works in a top-down manner. The algorithm is an inverse order of AGNES. It begins with the root, in which all objects are included in a single cluster. At each step of iteration, the most heterogeneous cluster is divided into two. The process is iterated until all objects are in their own cluster (see figure below).

Note that agglomerative clustering is good at identifying small clusters. Divisive hierarchical clustering is good at identifying large clusters.

Linkage methods

The merging or the division of clusters is performed according some (dis)similarity measure. In R softwrare, the Euclidean distance is used by default to measure the dissimilarity between each pair of observations.

As we already know, it’s easy to compute dissimilarity measure between two pairs of observations. It’s mentioned above that two clusters that are most similar are fused into a new big cluster.

A natural question is : How to measure the dissimilarity between two clusters of observations?

The most common types methods are:

Maximum or complete linkage clustering: It computes all pairwise dissimilarities between the elements in cluster 1 and the elements in cluster 2, and considers the largest value (i.e., maximum value) of these dissimilarities as the distance between the two clusters. It tends to produce more compact clusters.

Minimum or single linkage clustering: It computes all pairwise dissimilarities between the elements in cluster 1 and the elements in cluster 2, and considers the smallest of these dissimilarities as a linkage criterion. It tends to produce long, “loose” clusters.

Mean or average linkage clustering: It computes all pairwise dissimilarities between the elements in cluster 1 and the elements in cluster 2, and considers the average of these dissimilarities as the distance between the two clusters.

Centroid linkage clustering: It computes the dissimilarity between the centroid for cluster 1 (a mean vector of length p variables) and the centroid for cluster 2.

Ward’s minimum variance method: It minimizes the total within-cluster variance. At each step the pair of clusters with minimum between-cluster distance are merged.

Complete linkage and Ward’s method are generally preferred.

Examples of linkage methods

summary(USArrests)
##      Murder          Assault         UrbanPop          Rape      
##  Min.   : 0.800   Min.   : 45.0   Min.   :32.00   Min.   : 7.30  
##  1st Qu.: 4.075   1st Qu.:109.0   1st Qu.:54.50   1st Qu.:15.07  
##  Median : 7.250   Median :159.0   Median :66.00   Median :20.10  
##  Mean   : 7.788   Mean   :170.8   Mean   :65.54   Mean   :21.23  
##  3rd Qu.:11.250   3rd Qu.:249.0   3rd Qu.:77.75   3rd Qu.:26.18  
##  Max.   :17.400   Max.   :337.0   Max.   :91.00   Max.   :46.00
completelinkage <- hclust(dist(scale(USArrests)), 'complete')
plot(completelinkage)

# scale is generic function whose default method centers and/or scales the columns of a numeric matrix.
singlelinkage <- hclust(dist(scale(USArrests)), 'single')
plot(singlelinkage)

averagelinkage <- hclust(dist(scale(USArrests)), 'average')
plot(averagelinkage)

wardlinkage <- hclust(dist(scale(USArrests)), 'ward.D2')
plot(wardlinkage)

Exploring a dataset

data('USArrests')
dataclear <- na.omit(USArrests)
head(dataclear, n=6)
##            Murder Assault UrbanPop Rape
## Alabama      13.2     236       58 21.2
## Alaska       10.0     263       48 44.5
## Arizona       8.1     294       80 31.0
## Arkansas      8.8     190       50 19.5
## California    9.0     276       91 40.6
## Colorado      7.9     204       78 38.7
desc_stats <- data.frame(
  Min = apply(dataclear, 2, min), # minimum
  Med = apply(dataclear, 2, median), # median
  Mean = apply(dataclear, 2, mean), # mean
  SD = apply(dataclear, 2, sd), # Standard deviation
  Max = apply(dataclear, 2, max) # Maximum
  )

desc_stats <- round(desc_stats, 1)

head(desc_stats)
##           Min   Med  Mean   SD   Max
## Murder    0.8   7.2   7.8  4.4  17.4
## Assault  45.0 159.0 170.8 83.3 337.0
## UrbanPop 32.0  66.0  65.5 14.5  91.0
## Rape      7.3  20.1  21.2  9.4  46.0

the variables have a large different means and variances. This is explained by the fact that the variables are measured in different units; Murder, Rape, and Assault are measured as the number of occurrences per 100 000 people, and UrbanPop is the percentage of the state’s population that lives in an urban area.

They must be standardized (i.e., scaled) to make them comparable. Recall that, standardization consists of transforming the variables such that they have mean zero and standard deviation one.

scaled_stats <- scale(dataclear)
head(dataclear)
##            Murder Assault UrbanPop Rape
## Alabama      13.2     236       58 21.2
## Alaska       10.0     263       48 44.5
## Arizona       8.1     294       80 31.0
## Arkansas      8.8     190       50 19.5
## California    9.0     276       91 40.6
## Colorado      7.9     204       78 38.7

R functions for hierarchical clustering

There are different functions available in R for computing hierarchical clustering. The commonly used functions are:

hclust() [in stats package] and agnes() [in cluster package] for agglomerative hierarchical clustering (HC) diana() [in cluster package] for divisive HC

hclust() function

The simplified format is:

hclust(d, method = “complete”)

d => a dissimilarity structure as produced by the dist() function.

method => The agglomeration method to be used. Allowed values is one of “ward.D”, “ward.D2”, “single”, “complete”, “average”, “mcquitty”, “median” or “centroid”.

# Dissimilarity matrix
d <- dist(scaled_stats, method = "euclidean")

# Hierarchical clustering using Ward's method
res.hc <- hclust(d, method = "ward.D2" )

# Plot the obtained dendrogram
plot(res.hc, cex = 0.6, hang = -1)

Agnes() and diana() functions

The R function agnes() (in cluster package) can be also used to compute agglomerative hierarchical clustering. The R function diana() (in cluster package) is an example of divisive hierarchical clustering.

Agglomerative Nesting (Hierarchical Clustering) agnes(x, metric = “euclidean”, stand = FALSE, method = “average”)

DIvisive ANAlysis Clustering diana(x, metric = “euclidean”, stand = FALSE)

x: data matrix or data frame or dissimilarity matrix. In case of matrix and data frame, rows are observations and columns are variables. In case of a dissimilarity matrix, x is typically the output of daisy() or dist(). metric: the metric to be used for calculating dissimilarities between observations. Possible values are “euclidean” and “manhattan”. stand: if TRUE, then the measurements in x are standardized before calculating the dissimilarities. Measurements are standardized for each variable (column), by subtracting the variable’s mean value and dividing by the variable’s mean absolute deviation method: The clustering method. Possible values includes “average”, “single”, “complete”, “ward”.

The function agnes() returns an object of class “agnes” which has methods for the functions: print(), summary(), plot(), pltree(), as.dendrogram(), as.hclust() and cutree().

The function diana() returns an object of class “diana” which has also methods for the functions: print(), summary(), plot(), pltree(), as.dendrogram(), as.hclust() and cutree().

Compared to other agglomerative clustering methods such as hclust(), agnes() has the following features:

It yields the agglomerative coefficient (see agnes.object) which measures the amount of clustering structure found Apart from the usual tree it also provides the banner, a novel graphical display (see plot.agnes).

R code for computing agnes

library("cluster")

# Compute agnes()
res.agnes <- agnes(scaled_stats, method = "ward")

# Agglomerative coefficient
res.agnes$ac
## [1] 0.934621
## [1] 0.934621
# Plot the tree using pltree()
pltree(res.agnes, cex = 0.6, hang = -1,
       main = "Dendrogram of agnes") 

# plot.hclust()
plot(as.hclust(res.agnes), cex = 0.6, hang = -1)

# plot.dendrogram()
plot(as.dendrogram(res.agnes), cex = 0.6, 
     horiz = TRUE)

R code for computing diana

# Compute diana()
res.diana <- diana(scaled_stats)

# Divise coefficient
res.diana$dc
## [1] 0.8514345
# Plot the tree
pltree(res.diana, cex = 0.6, hang = -1,
       main = "Dendrogram of diana")

# plot.hclust()
plot(as.hclust(res.diana), cex = 0.6, hang = -1)

# plot.dendrogram()
plot(as.dendrogram(res.diana), cex = 0.6, 
     horiz = TRUE)

Interpretation of the dendogram

In the dendrogram displayed above, each leaf corresponds to one observation. As we move up the tree, observations that are similar to each other are combined into branches, which are themselves fused at a higher height.

The height of the fusion, provided on the vertical axis, indicates the (dis)similarity between two observations. The higher the height of the fusion, the less similar the observations are.

Note that, conclusions about the proximity of two observations can be drawn only based on the height where branches containing those two observations first are fused. We cannot use the proximity of two observations along the horizontal axis as a criteria of their similarity.

In order to identify sub-groups (i.e. clusters), we can cut the dendrogram at a certain height as described in the next section.

Cut the dendogram into different groups

The height of the cut to the dendrogram controls the number of clusters obtained. It plays the same role as the k in k-means clustering.

The function cutree() is used and it returns a vector containing the cluster number of each observation:

# Cut tree into 4 groups
grp <- cutree(res.hc, k = 4)
# Number of members in each cluster
table(grp)
## grp
##  1  2  3  4 
##  7 12 19 12
# Get the names for the members of cluster 1
rownames(scaled_stats)[grp == 1]
## [1] "Alabama"        "Georgia"        "Louisiana"      "Mississippi"   
## [5] "North Carolina" "South Carolina" "Tennessee"
## [1] "Alabama"        "Georgia"        "Louisiana"      "Mississippi"   
## [5] "North Carolina" "South Carolina" "Tennessee"

It’s also possible to draw the dendrogram with a border around the 4 clusters. The argument border is used to specify the border colors for the rectangles:

plot(res.hc, cex = 0.6)
rect.hclust(res.hc, k = 4, border = 2:5)

Using the function fviz_cluster() [in factoextra], we can also visualize the result in a scatter plot. Observations are represented by points in the plot, using principal components. A frame is drawn around each cluster.

library(factoextra)
## Loading required package: ggplot2
## Welcome! Related Books: `Practical Guide To Cluster Analysis in R` at https://goo.gl/13EFCZ
fviz_cluster(list(data=scaled_stats, cluster = grp))

The function cutree() can be used also to cut the tree generated with agnes() and diana() as follow:

# Cut agnes() tree into 4 groups
cutree(res.agnes, k = 4)
##  [1] 1 2 2 3 2 2 3 3 2 1 3 4 2 3 4 3 3 1 4 2 3 2 4 1 3 4 4 2 4 3 2 2 1 4 3
## [36] 3 3 3 3 1 4 1 2 3 4 3 3 4 4 3
# Cut diana() tree into 4 groups
cutree(as.hclust(res.diana), k = 4)
##        Alabama         Alaska        Arizona       Arkansas     California 
##              1              2              2              3              2 
##       Colorado    Connecticut       Delaware        Florida        Georgia 
##              2              3              3              2              1 
##         Hawaii          Idaho       Illinois        Indiana           Iowa 
##              3              4              2              3              4 
##         Kansas       Kentucky      Louisiana          Maine       Maryland 
##              3              4              1              4              2 
##  Massachusetts       Michigan      Minnesota    Mississippi       Missouri 
##              3              2              4              1              2 
##        Montana       Nebraska         Nevada  New Hampshire     New Jersey 
##              4              4              2              4              3 
##     New Mexico       New York North Carolina   North Dakota           Ohio 
##              2              2              1              4              3 
##       Oklahoma         Oregon   Pennsylvania   Rhode Island South Carolina 
##              3              3              3              3              1 
##   South Dakota      Tennessee          Texas           Utah        Vermont 
##              4              1              2              3              4 
##       Virginia     Washington  West Virginia      Wisconsin        Wyoming 
##              3              3              4              4              3

Comparing two dendograms

Package dendextend which contains many functions for comparing two dendrograms, including: dend_diff(), tanglegram(), entanglement(), all.equal.dendrogram(), cor.dendlist().

A random subset of the dataset will be used in the following example. The function sample() is used to randomly select 10 observations among the 50 observations contained in the data set

# Subset containing 10 rows
set.seed(123)
ss <- sample(1:50, 10)
df <- data.frame(ss)

In the R code below, we’ll start by computing pairwise distance matrix using the function dist(). Next, hierarchical clustering (HC) is computed using two different linkage methods (“average” and “ward.D2”). Finally the results of HC are transformed as dendrograms:

# Compute distance matrix
res.dist <- dist(df, method = "euclidean")

# Compute 2 hierarchical clusterings
hc1 <- hclust(res.dist, method = "average")
hc2 <- hclust(res.dist, method = "ward.D2")
dend1 <- as.dendrogram (hc1)
dend2 <- as.dendrogram (hc2)
library(dendextend)
## 
## ---------------------
## Welcome to dendextend version 1.8.0
## Type citation('dendextend') for how to cite the package.
## 
## Type browseVignettes(package = 'dendextend') for the package vignette.
## The github page is: https://github.com/talgalili/dendextend/
## 
## Suggestions and bug-reports can be submitted at: https://github.com/talgalili/dendextend/issues
## Or contact: <tal.galili@gmail.com>
## 
##  To suppress this message use:  suppressPackageStartupMessages(library(dendextend))
## ---------------------
## 
## Attaching package: 'dendextend'
## The following object is masked from 'package:stats':
## 
##     cutree
dend_list <- dendlist(dend1, dend2)

Tanglegram

The function tanglegram() plots two dendrograms, side by side, with their labels connected by lines. It can be used for visually comparing two methods of Hierarchical clustering as follow:

tanglegram(dend1, dend2)

Note that “unique” nodes, with a combination of labels/items not present in the other tree, are highlighted with dashed lines.

The quality of the alignment of the two trees can be measured using the function entanglement().

tanglegram(dend1, dend2)

tanglegram(dend1, dend2,
  highlight_distinct_edges = FALSE, # Turn-off dashed lines
  common_subtrees_color_lines = FALSE, # Turn-off line colors
  common_subtrees_color_branches = TRUE, # Color common branches 
  main = paste("entanglement =", round(entanglement(dend_list), 2))
  )

# Entanglement is a measure between 1 (full entanglement) and 0 (no entanglement). A lower entanglement coefficient corresponds to a good alignment

Correlation matrix between a list of dendogram

The function cor.dendlist() is used to compute “Baker” or “Cophenetic” correlation matrix between a list of trees.

# Cophenetic correlation matrix
cor.dendlist(dend_list, method = "cophenetic")
##           [,1]      [,2]
## [1,] 1.0000000 0.9857437
## [2,] 0.9857437 1.0000000
# Baker correlation matrix
cor.dendlist(dend_list, method = "baker")
##           [,1]      [,2]
## [1,] 1.0000000 0.9677821
## [2,] 0.9677821 1.0000000

The correlation between two trees can be also computed as follow:

# Cophenetic correlation coefficient
cor_cophenetic(dend1, dend2)
## [1] 0.9857437
# Baker correlation coefficient
cor_bakers_gamma(dend1, dend2)
## [1] 0.9677821

It’s also possible to compare simultaneously multiple dendrograms. A chaining operator %>% (available in dendextend) is used to run multiple function at the same time. It’s useful for simplifying the code:

# Subset data
set.seed(123)
ss <- sample(1:150, 10 )

# Create multiple dendrograms by chaining
dend1 <- df %>% dist %>% hclust("com") %>% as.dendrogram
dend2 <- df %>% dist %>% hclust("single") %>% as.dendrogram
dend3 <- df %>% dist %>% hclust("ave") %>% as.dendrogram
dend4 <- df %>% dist %>% hclust("centroid") %>% as.dendrogram

# Compute correlation matrix
dend_list <- dendlist("Complete" = dend1, "Single" = dend2,
                      "Average" = dend3, "Centroid" = dend4)
cors <- cor.dendlist(dend_list)

# Print correlation matrix
round(cors, 2)
##          Complete Single Average Centroid
## Complete     1.00   0.97    0.99     0.98
## Single       0.97   1.00    0.99     0.99
## Average      0.99   0.99    1.00     1.00
## Centroid     0.98   0.99    1.00     1.00
# Visualize the correlation matrix using corrplot package
library(corrplot)
## corrplot 0.84 loaded
corrplot(cors, "pie", "lower")

Hierarchical Clustering with IRIS data

Hierarchical clustering is an approach which builds a hierarchy from the bottom-up, and doesn’t require to specify the number of clusters beforehand.

This is usually represented by dendogram.

We can also determine how close two clusters are: - Complete linkage clustering: Find the maximum possible distance between points belonging to two different clusters - Single linkage clustering: Find the minimum possible distance between points belonging to two different clusters - Mean linkage clustering: Find all possible pairwise distances for points belonging to two different clusters and then calculate the average - Centroid linkage clustering: Find the centroid of each cluster and calculate the distance between centroids of two clusters

summary(iris)
##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##        Species  
##  setosa    :50  
##  versicolor:50  
##  virginica :50  
##                 
##                 
## 
# There are 3 species of iris flowers.

In order to perform hierarchical clustering, ‘hclust’ will be used. It requires to provide the data in the form of a distance matrix. This can be done by using ‘dist’.

clusters <- hclust(dist(iris[,3:4]))
plot(clusters)

# iris[,3:4] => return petal.length and petal.width

This generated a dendogram. Now we can decide on number of clusters. In this example, it could be 3 or 4. To cut number of clusters, ‘cutree’ will be used.

clustercut<- cutree(clusters,3)

Let’s compare this with original data

table(clustercut, iris$Species)
##           
## clustercut setosa versicolor virginica
##          1     50          0         0
##          2      0         21        50
##          3      0         29         0

setosa and virginica were successfuly classified into clusters. Versicolor was split into two clusters. In the graph below, I can see why.

library(ggplot2)

ggplot(iris, aes(Petal.Length, Petal.Width, color = iris$Species)) + 
  geom_point()

Other linkage method for clustering can be used e.g. mean linkage method:

clusters <- hclust(dist(iris[,3:4]), method = 'average')
plot(clusters)

clusterCut <- cutree(clusters, 3)
table(clusterCut, iris$Species)
##           
## clusterCut setosa versicolor virginica
##          1     50          0         0
##          2      0         45         1
##          3      0          5        49

Algorithm did a much better job of clustering the data, only went wrong with 6 of the data points.

Determining Optimal Clusters

We can execute similar approaches for hierarchical clustering as with k-means clustering

Elbow Method

To perform the elbow method we just need to change the second argument in fviz_nbclust to FUN=hcut

I will used previously defined ‘scale_stats’

fviz_nbclust(scaled_stats, FUN = hcut, method = "wss")

Average Silhouette Method

To perform the average silhouette method we follow a similar process.

fviz_nbclust(scaled_stats, FUN=hcut, method = "silhouette")

Gap Statistics Method

gap_stat <- clusGap(scaled_stats, FUN = hcut, nstart = 25, K.max = 10, B = 50)
fviz_gap_stat(gap_stat)

Heatmaps

By default, heatmap clusters by both rows and columns. It then reorders the resulting dendrograms according to mean. Setting Colv to false tells it not to reorder the columns, which will come in handy later. Let’s also turn off the default scaling across rows. We’ve already scaled across columns, which is the sensible thing to do in this case.

# cluster rows
hc.rows <- hclust(dist(scaled_stats))
plot(hc.rows)

# transpose the matrix and cluster columns
hc.cols <- hclust(dist(t(scaled_stats)))

# heatmap
heatmap(scaled_stats, Colv=F, scale='none')

# draw heatmap for first cluster, k determines number of cluster (if 2, dendogram is divided into 2 clusters and zoom-in)
heatmap(scaled_stats[cutree(hc.rows,k=2)==1,], Colv=as.dendrogram(hc.cols), scale='none')

# draw heatmap for second cluster
heatmap(scaled_stats[cutree(hc.rows,k=2)==2,], Colv=as.dendrogram(hc.cols), scale='none')